Intro¶
Neural style transfer is the process of generating a new image that combines the content from one image and the style(s) from another. It is an ill-posed problem, as there is no single correct output for style transfer, and traditional supervised learning algorithms cannot be readily applied due to the requirement of a pair of input images containing the content and its stylized counterpart, which is impractical. I propose to address this problem by modifying existing methods, including:
- A Neural Algorithm of Artistic Style by Gatys et al.
- A learned representation for Artistic Style by Dumoulin et al.
A Neural Algorithm of Artistic Style¶
Gatys et al. published one of the first seminal works to use neural networks to solve this problem. The most important idea here is "representations of content and style in the Convolutional Neural Network are separable". They use the following definitions to remove ambiguity from the problem.
- Two images are similar in content if their high-level features as extracted by a trained classifier are close in Euclidian distance.
- Two images are similar in style if their low-level features as extracted by a trained classifier share the same statistics or, more concretely, if the difference between the features’ Gram matrices has a small Frobenius norm.
Feature correlations among the feature maps are given by the Gram matrix, i.e for a layer $l$, $G_{i j}^l$ is the inner product between the vectorized feature maps $i$ and $j$ $$ G_{i j}^l=\sum_k F_{i k}^l F_{j k}^l $$
The objective is to minimize the weighted sum of style and content losses. The desired stylized image is found by using gradient descent to find the minima.
Given content image $\vec{p}$ and style image $\vec{a}$, target image $\vec{x}$, the loss function is
$$\mathcal{L}_{\text {total }}(\vec{p}, \vec{a}, \vec{x})=\alpha \mathcal{L}_{\text {content }}(\vec{p}, \vec{x})+\beta \mathcal{L}_{\text {style }}(\vec{a}, \vec{x})$$
Limitations¶
- prohibitively expensive espectially for high resolution images - task is modelled as an optimization problem, requiring hundreds or thousands of iterations.
- Haven't provided much evidence for applicability in multi-style transfer extension
Despite its flaws, the algorithm is very flexible. Therefore, I decided to first explore this method to solve this problem.
Modified Algorithm¶
I just replace the the style loss with weighted loss terms corresponding to each style.
$$\mathcal{L}_{\text {total }}(\vec{p}, \vec{a}, \vec{x})=\alpha \mathcal{L}_{\text {content }}(\vec{p}, \vec{x})+\sum_{i=1}^{N}\beta_i \mathcal{L}_{\text {style}_i}(\vec{a}_i, \vec{x})$$
Pretrained model: VGG19¶

VGG19, pretrained on IMAGENET, is used as a feature extractor. It achieved SOTA during its time. We are only concerned with features and not classification, so the fully connected layers are safely removed. The output of the individual convolution layers will be used to find the content and style losses.
The object information becomes increasingly explicit along the processing hierarchy. Detailed pixel information is lost while the high-level content of the image is preserved. Therefore, the conv_4 or conv_5 filters can be used to find the content loss.
The style representation is computed using the correlations between the different features in different layers of the model, where the expectation is taken over the spatial extend of the input image. The texture/style is the statistical relationship between the pixels of a source image, which is assumed to have a stationary distribution at some scale. Therefore, all the conv outputs can be used to compute style loss.
# find loss for each style
if name in style_layers:
style_loss = []
for index, style_image in enumerate(style_imgs):
target_feature = model(style_img).detach()
style_loss_index = StyleLoss(target_feature)
model.add_module("style_loss_{}_{}".format(i, index), style_loss_index)
style_loss.append(style_loss_index)
style_losses.append(style_loss)
# combine together using weight
for sl in style_losses:
for index, style_loss in enumerate(sl):
style_score += style_coeffs[index] * style_loss.loss
Additionally, VGG networks are trained on images with each channel normalized by mean=[0.485, 0.456, 0.406] and std=[0.229, 0.224, 0.225]. We will use them to normalize the image before sending it into the network.
This approach is effective but not scalable and is impractical for the desired resolution. Johnson et al., in the "Perceptual Losses for Real-Time Style Transfer and Super-Resolution" paper, addressed this limitation by introducing a feedforward style transfer network. This network is trained to directly transform content into a stylized image in a single pass. However, it is not directly applicable to our problem as the network is trained on a single style
A LEARNED REPRESENTATION FOR ARTISTIC STYLE¶
Dumoulin et al. describe a method that easily scales to $N$-styles while still being fast. A visual texture or style is conjectured to be spatially homogenous. It consists of repeated structural motifs whose minimal sufficient statistics are captured by lower order statistical measurements. The most important idea is that styles probably share some degree of computation. For example, different art may have similar paint strokes but differ in the color palette. They propose Conditional Instance Normalization (CIN), which allows all convolutional weights of a style transfer network to be shared across many styles. It is sufficient to tune parameters for an affine transformation after normalization for each style. Scaling and shifting are the only requirements to condition on a specific style, after normalizing the layer’s activations $x$.
$$z=\gamma_s\left(\frac{x-\mu}{\sigma}\right)+\beta_s$$
where $\mu$ and $\sigma$ are $x$'s mean and standard deviation across both the spatial axes.
This method is demonstrated to generalize across a diversity of artistic styles, reducing a painting to a point in an embedding space and permitting a user to explore new painting styles by arbitrarily combining the styles learned from individual paintings.
Architecture¶

Image Transform Network¶

Image Transform Network is a feedforward style transfer network, which is trained to go from content to stylized image in one pass. VGG (pretrained) serves as loss function similar to the work of Gatys et al.
Image Transform Network is a deep residual convolutional neural network that converts white noise image into stylized image. Instead of pooling layers which causes information loss, they use strided convolutions. Residual blocks allow for a deeper network. They use nearest neighbour upsampling instead of fractional convolution to prevent checkerboard patterns.
I am using the entire pretrained model, not just VGG because of computational and time constraints. It has been trained on the 'varied' set of paintings.
Sources - https://github.com/magenta/magenta/tree/main/magenta/models/image_stylization
Next, define the utility functions for getting the stylized images from the pretrained model.
We can browse through the styles using the slider.
Now, suppose we want to narrow down to a few styles of interest. I am choosing the following without loss of generality.
style_indices = [1, 31, 23, 8]
fig, axs = plt.subplots(2, 2, figsize=(15, 15))
plt.tight_layout()
for i in range(4):
axs[i//2, i%2].imshow(varied_stylized[style_indices[i]],interpolation='nearest')
axs[i//2, i%2].axis('off')
How to find the right mix?¶
Deciding what weights to give any individual style can become intractable. Very few in the space of weight vectors may generate useful and pleasant stylized images. We can only recognize the desired behavior, but not necessarily demonstrate it. It motivates the application of reinforcement learning with human feedback.
Reinforcement Learning with Human Feedback (RLHF)¶
RLHF was used by OpenAI to significantly improve their GPT model. Inspired by this, I want to test whether this idea may be relevant for our problem as well. We could leverage human feedback to learn a reward model which assigns a score to any particular weights distribution vector. This may become useful to find the optimal set of style weights.
RLHF is very flexible and there are multiple ways to implment it. I choose a very simple algorithm and consider the style tranfer network as a black box. I don't want to delve into model training because of computational constraints.
Let us first see how we can use the human feedback to learn the reward model.
Steps
- Create a UI to present pair of images to the human labeler. They will choose a winner from them. Use it to construct a dataset in the form of $(y_w, y_l)$ where $y_w$ corresponds to the style weights of the winning image and $v_l$ is the style weights of the losing image.
- Learn the reward model $r_\phi$

Source : https://huyenchip.com/2023/05/02/rlhf.html#phase_3_rlhf
After learning the $r_\phi$, we can use the standard RL algorithms to find the optimal weights. I will give a brief overview of this.
RL agent interacts with the world (called it’s environment) by using a policy to choose at every time step from a set of actions. The environment responds by causing the agent to transition to its next state, and providing a reward attributed to the last action from the prior state.

Policy is a behavious function that maps states to actions. We want to learn the optimal policy i.e = the distribution over actions that gives the highest reward.
$$ a=\pi(s) $$
I am modelling the problem by making the following assumptions -
There is single state - the initial content image. We want to learn optimal policy which gives a desirable stylized image.
Action space $\mathcal{A}$ is continuous and multi-variate i.e $a \in \mathbb{R}^N$ where $N$ is the number of styles. It is a distribution over the styles.
We can directly use any of the standard RL algorithms.
Note¶
I will not be implementing the RL algorithm to find the optimal weights because of the practical issues. I am not sure whether RL will be so useful given there is only a single state. Furthermore, these are notoriously data hungry and unstable. We simply can't generate huge datasets quickly. Though, Direct Preference optimization can come handy here.
I will only be implementing the reward model and train it on a toy dataset.
RLHF UI¶
The images load slowly, so please wait for the Choose message before clicking the buttons.
Limitations¶
- We are indirectly learning the stylized image using the individual layer embeddings. It would be faster to directly learn from the style and content images.
- This method is not leveraging the current generative methods which are bound to give much higher quality images.